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In software defect prediction, noisy attributes and high-dimensional data 
remain to be a critical challenge. This paper introduces a novel approach 
known as multi correlation-based feature selection (MCFS), which seeks to 
address these challenges. MCFS integrates two feature selection techniques, 
namely correlation-based feature selection (CFS) and correlation matrix- 
based feature selection (CMFS), intending to reduce data dimensionality and 
eliminate noisy attributes. To accomplish this, CFS and CMFS are applied 
independently to filter the datasets, and a weighted average of their 
outcomes is computed to determine the optimal feature selection. This 
approach not only reduces data dimensionality but also mitigates the impact 
of noisy attributes. To further enhance predictive performance, this paper 
leverages the particle swarm optimization (PSO) algorithm as a feature 
selection mechanism, specifically targeting improvements in the area under 
the curve (AUC). The evaluation of the proposed method is conducted on 12 
benchmark datasets sourced from the NASA metrics data program (MDP) 
corpus, renowned for their noisy attributes, high dimensionality, and 
imbalanced class records. The research findings demonstrate that MCFS 
outperforms CFS and CMFS, yielding an average AUC value of 0.891, 
thereby emphasizing it is efficacy in advancing classification performance in 
the context of software defect prediction using k-nearest neighbors (KNN) 
classification. 
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1. INTRODUCTION 


In software development, software defect prediction is an important aspect in ensuring the quality 
and reliability of software [1]. Software defect prediction datasets often have noisy attribute properties [2], 
high dimensional [3], and imbalance classes [4]. In practice, these noisy attributes can affect the accuracy of 


the prediction [5]. 


To solve the problem of attributes containing noise, particle swarm optimization (PSO) has been 
proven to be an effective optimization algorithm [6]. However, PSO also has weaknesses, especially in 
high-dimensional datasets. High-dimensional datasets are characterized by a large number of attributes and 
records [6], [7]. In the case of PSO, high dimensional datasets can lead to premature convergence and are less 
effective in dealing with noise-containing attributes, whereas PSO tends to generate solutions that are 
suboptimal to a point in the search space without achieving a better solution [5]-[7]. 
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To overcome the challenge of high dimensionality in software defect prediction data, several 
filtering techniques can be used, such as correlation-based feature selection (CFS) [8] and correlation matrix- 
based feature selection (CMFS) [8], [9]. These techniques aim to reduce the dimensionality of the data by 
identifying highly correlated attributes and removing attributes that have no significant correlation with the 
target variable. By reducing the dimensionality of the data, the complexity of the calculations can be reduced 
and noisy attributes can be eliminated, thereby improving the accuracy of the prediction [8]. 

Empirical and theoretical evidence suggests that the application of filtration methods to reduce 
dimensionality and eliminate noise-containing attributes by a number of classifiers has the potential to 
improve accuracy in the final prediction model [10]. However, research applying the filtering method 
approach to the PSO problem is very limited. Many studies stop at CFS and do not apply CMFS due to the 
Multicollinearity problem in CMFS which results in feature selection to select attributes that contain noise 
[11]-[13]. 

Multicollinearity occurs when there is a high correlation between several attributes, which can lead 
to problems in the interpretation of results and model stability [12], [13]. To overcome multicollinearity, a 
common approach is to remove highly correlated attributes so that only the most informative attributes that 
have a significant correlation with the target variable are retained. This helps to reduce ambiguity and 
improve the interpretability and stability of the models generated by correlation-based feature selection 
techniques [12]. 

In this research, the PSO method and filtering techniques will be integrated to overcome the 
challenges associated with noisy attributes and high data dimensionality in software defect prediction. This 
research will utilize filtering techniques such as CFS and CMFS to reduce data dimensionality and remove 
noisy attributes. Furthermore, the PSO algorithm will be applied as an optimization method to improve the 
accuracy and area under the curve (AUC) [14] in prediction using k-nearest neighbors (KNN) [15]. KNN is a 
classification algorithm that classifies samples based on the presence of nearby samples in the attribute space 
[16]. KNN is one of the easiest and most effective methods to be used in software defect prediction [15], 
[16]. By integrating these methods, it is expected to overcome the problem of attributes containing noise, 
reduce the data dimension, and improve the quality of prediction in predicting software defects. 


2. METHOD 

In the proposed method, a series of experiments were conducted on the NASA metrics data program 
(MDP) dataset to identify the impact of using CFS and CMFS. The output of these filtering stages is then 
optimized through a PSO approach to perform software defect prediction. Illustrated in Figure 1, represents a 
sophisticated model that seamlessly integrates CFS and CMFS. 

The process begins with applying CFS and CMFS for feature selection, where a filtering mechanism 
is used to address multicollinearity by removing highly correlated attributes. Following initial filtering, 
features are selected based on averaged weights between CFS and CMEFS. The dataset is then split into 
training and testing subsets. Next, PSO is employed to further refine feature selection and optimize weights. 
In the PSO phase, training datasets, already filtered, undergo a 10-Fold cross-validation using KNN as the 
classifier. The outcome is an optimized KNN model, used to evaluate and refine test data. This integrated 
approach enhances feature selection efficiency, leading to improved overall model performance. 


Correlation-Based Feature Selection Highly Correlated Attribute Filtering 


Correlation Matrix-Based Feature Selection 


Weight Averaging 
Classification With PSO Feature Selection 
Performance Evaluation 


Figure 1. Proposed multi correlation-based feature selection (MCFS) classification 
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2.1. Nasa metrics data program 

For experiment replication and verification, the proposed approach is applied to a set of 12 
benchmark datasets focused on software defects. These datasets are sourced from the NASA corpus, which 
encompasses real software projects across diverse domains and programming languages (C, C++, Java). 
They exhibit variations in code size and include various software metrics. However, it is important to note 
that the NASA corpus is known to contain noisy attributes [17], have high dimensionality [18], and have 
imbalanced class records [19]. For example, the NASA JM1 dataset comprises 7,782 records with 1,672 
containing defects and 6,110 without defects, each consisting of 22 attributes. Table 1 is presented, which 
contains information and some general statistics about each of the datasets used. 


Table 1. Datasets specifications 
Datasets Attributes Instances Defects Non-defects Defects% Non-defects% 


CM1 38 327 42 285 12.8 87.2 
JM1 22 7,182 1,672 6,110 215 78.5 
KC1 22 1,183 314 869 26.5 73.5 
KC3 40 194 36 158 18.6 8.4 
MCI 39 1,988 46 1,942 2.3 97.7 
MC2 40 125 44 81 35.2 64.8 
MW1 38 253 27 226 10.7 89.3 
PC1 38 705 61 644 8.7 91.3 
PC2 37 745 16 729 2.1 97.9 
PC3 38 1,077 134 943 12.4 87.6 
PC4 38 1,287 177 1,110 13.8 86.2 
PCS 39 1,711 471 1,240 21:5 729 


2.2. Feature selection 

Handling high-dimensional datasets requires feature selection, which involves choosing the most 
relevant attributes and eliminating redundant or noisy ones. It provides dimensionality reduction and 
improves data quality [20], [21]. Feature selection plays a pivotal role in data analysis, with various methods 
available, including information gain, correlation, and more [22]. This research focuses on correlation-based 
feature selection methods, particularly two prominent approaches: CFS and CMFS. 


2.2.1. Correlation-based feature selection 

The CFS method is the approach used in this research to select a subset of features that are most 
relevant to the target variable. CFS employs the pearson coefficient formula to measure the linear 
relationship. The coefficient ranges from -1 to 1, indicating positive, negative, or no correlation between 
features and the target variable. This is the fundamental formula for the pearson coefficient (1) [23]: 


fe Auv (1) 
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In CFS, features with a high pearson coefficient value with the target are considered highly relevant, 
indicating strong predictive potential [24]. However, CFS doesn’t assess relationships between selected 
features [25]. It prioritizes individual feature-target correlations, aiming to select highly correlated features 
while disregarding less correlated ones, thus mitigating multicollinearity concerns [26]. 


2.2.2. Highly correlated attribute filtering 

When working with multiple features, a common issue in statistical analysis called multicollinearity, 
poses a significant challenge in correlation-based feature selection. It occurs when multiple dataset features 
display strong intercorrelation, leading to various problems. Multicollinearity obscures individual feature 
contributions to the target variable, hampers model interpretability, and destabilizes parameter estimation. 

In addition, multicollinearity also increases the risk of overfitting, making the model overly complex 
and prone to emphasizing noise over genuine patterns. Furthermore, it complicates diagnosing performance 
issues, making it challenging to determine whether poor model performance is due to multicollinearity or 
other factors [27]. Feature selection that has multicollinearity issues such as CMFS [26] will benefit from this 
filtering. Highly correlated attribute filtering removes one of the variables involved, typically based on the 
weakest or least relevant relationship in the model [27]. The filter is positioned prior to the CMFS. 
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2.2.3. Correlation matrix-based feature selection 

CMFS is an important technique in data analysis and statistical modeling, enabling algorithms to 
pinpoint the most relevant features. CMFS utilizes the pearson coefficient (1) to represent correlations [28]. 
Enhanced feature selection enhances algorithmic models, promoting improved decision-making in diverse 
fields like business and science. CMFS, employing Pearson’s correlation coefficient, is a pivotal tool for 
achieving superior prediction and decision outcomes [29]. In contrast to CFS, which focuses on the 
correlation between features and the target variable, CMFS takes a broader perspective. CMFS constructs a 
correlation matrix encompassing pairwise correlations among all features. Each matrix entry represents the 
Pearson correlation coefficient for a specific feature pair, offering detailed insights into the interrelationships 
within the entire feature set. 


2.3. Weight averaging 

After the initial dataset filtering, the CFS and CMFS assign weights to each feature based on their 
correlation significance. These individual weights are then averaged to generate a comprehensive score for 
each feature. The resultant average weights serve as indicators of feature importance, with higher scores 
suggesting greater relevance. Features with superior average weights are subsequently selected for inclusion 
in the refined dataset, optimizing its composition by prioritizing those elements that contribute more 
substantially to the desired outcome in the context of machine learning or data analysis tasks. 


2.4. Classification with particle swarm optimization feature selection 

In this PSO phase, training datasets, already filtered, undergo a 10-Fold cross-validation using KNN 
as the classifier. The dataset is divided into 10 subsets, with each iteration utilizing 9 subsets for training and 
the remaining one for validation. The performance of the KNN model is assessed based on metrics like 
accuracy and precision, and the optimal hyperparameters are determined. Subsequently, the best-performing 
KNN model is applied to an independent testing subset. 

PSO feature selection is employed to find the optimal feature subset, particularly effective for 
datasets with noisy attributes [30], [31]. In PSO, a population of particles represents feature subsets in a 
binary manner (1 for inclusion, O for exclusion). Particles are assessed based on machine learning model 
performance with their feature subsets. Particles adapt to personal best results (Pbest) and the overall best in 
the population (Gbest) during iterations, with PSO terminating when criteria are met to yield the best feature 
subset. PSO’s position and velocity changes derive from basic formulas (2) and (3) [10], [31]: 


xtD = xf 4 ytd (2) 


vČtÐ = yt + cr (Pbestt — xt) + c,r,(Gbest* — x!) (3) 


i = i 


< 


In the PSO algorithm, coefficients cı and cz control particle movement toward Pbest and Gbest, 
balancing exploration and exploitation. Random variables rı and r2 introduce stochasticity for efficient 
exploration. The adjustment of c1, c2, rı, and rz fine-tunes the PSO algorithm’s performance for optimal 
problem-solving [32]. 


2.5. Performance evaluation 

To evaluate the findings, experiments involved PSO, PSO with CFS, PSO with CMFS, and PSO 
with MCFS. This comprehensive approach aimed to thoroughly understand and rigorously evaluate each 
approach’s significance. AUC values, crucial for assessing data classification, offer distinct metrics for 
performance evaluation [33]. Ranging from 0 to 1, where 1 signifies perfect separation and 0.5 implies 
random classification, higher AUC values indicate superior model performance [34]. Performance 
differences among approaches were evaluated using a t-Test, comparing their average model performance 
and determining the statistical significance of AUC differences [35]. Significant t-Test results boost 
confidence in one approach's superiority [35], [36]. An alpha of 0.05 was chosen, allowing confident null 
hypothesis rejection with 95% certainty. Although alpha levels can be adjusted, 0.05 is widely accepted as a 
pragmatic compromise [36]. 

Algorithm 1 presents a detailed pseudocode of our algorithm, emphasizing the essential filtering 
process with strategically bolded sections denoting unique components. This visual guide offers a clear 
overview of our workflow, directing attention to key innovations. The bolded sections act as focal points, 
facilitating a nuanced understanding of our methodology. This concise visual aid serves as a roadmap for the 
algorithm’s structure, ensuring transparency and aiding in the implementation of our proposed method. 
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Algorithm 1. Pseudocode of the proposed method 
1 Begin 


2 NasaData=Datasets.GetNasaMDP () 

3 WeightsCFS[]=CorrelationFS.GetFeatureWeights (NasaData) 

4 WeightsCMFS []=CorrelationMatrixFS.GetFeatureWeights (NasaData) 

5 NasaDataNoHighCorr=Correlation.RemoveHighCorr (NasaData, minCorr=0.95) 

6 NamesOfFeature[]=NasaDataNoHighCorr.GetColumnNames () 

7 MultiWeights=|[] 

8 For feature in NamesOfFeature do 

9 MultiWeights [feature] =(WeightsCFS [feature] +WeightsCMFS [feature]) /2 

10 EndFor 

11 NasaDataMultiFilter=FeatureSelectionByWeights (NasaData, MultiWeights, method=”top”, 
k=10) 

12 Train, Test=SplitData (NasaDataMultiFilter, train=0.8, test=0.2) 

13 PSOPerformance=PSO. FeatureSelection (model=CrossVal.KFold(model=KNN (data=Train), k=10), 
data=Test) 

14 Performance=PSOPerformance.GetAUC () 

15 End 


3. RESULTS AND DISCUSSION 

This study investigates the effectiveness of MCFS in addressing challenges associated with 
high-dimensional datasets within PSO, particularly concerning issues such as premature convergence and 
inefficacy with noisy attributes as previously noted [5]-[7]. The primary objective is to evaluate whether the 
integration of MCFS leads to significantly improved software defect prediction (SDP) results while 
maintaining alignment with the chosen alpha value in t-Test comparisons. Results presented in Table 2, 
which showcases AUC differences from 48 experiments across 12 datasets, highlight the superior 
performance of PSO when integrated with MCFS compared to single-filter integration and no filtering. This 
finding underscores the potential of MCFS to enhance the performance of PSO in handling high-dimensional 
datasets, making it a noteworthy advancement in addressing pertinent challenges in optimization processes. 
Additional insights can be obtained from Figure 2, offering a detailed graphical representation of this 
methodology. The datasets are depicted along the horizontal axis, with the AUC values for each dataset 
represented on the vertical axis. Each model is differentiated by a distinct color, visualized through bars. 

The significance of our experiments is assessed through the t-Test results in Table 3. Table 3 
indicates that the difference in significance between using PSO alone and combining PSO with CFS or 
CMFS is not notably distinct, aligning consistently with the average performance values in Table 2. Notably, 
our proposed method stands out by demonstrating a consistently higher level of significance compared to 
PSO alone, as well as configurations involving PSO with CFS and PSO with CMES. This robust and superior 
level of significance underscores the effectiveness of our proposed method, highlighting it is potential to 
outperform not only standalone PSO but also combinations with specific feature selection techniques such as 
CFS and CMEFS. 

The superiority of our proposed method is further emphasized by the comparative analysis presented 
in Table 4. These findings, as demonstrated in the tables, validate the substantial impact introduced by our 
proposed method across all observed models. The average AUC values demonstrate that our method 
outperforms methodologies employed in other studies. Our method stands out, promising superior 
performance over standalone PSO and configurations with specific feature selection techniques. The 
demonstrated superiority in both significance and performance across diverse models underscores its 
potential as a valuable enhancement in the realm of data analysis and feature selection. 


Table 2. AUC performance of every experiment 
Datasets PSO __PSO CFS =PSOCMFS PSO MCFS 


CMI 0.72 0.815 0.813 0.891 
JM1 0.677 0.689 0.673 0.704 
KC1 0.768 0.724 0.738 0.784 
KC3 0.855 0.863 0.899 0.95 
MCI 0.836 0.851 0.811 0.949 
MC2 0.877 0.91 0.92 0.984 
MW1 0.901 0.881 0.891 0.942 
PC1 0.911 0.913 0.927 0.95 
PC2 0.792 0.782 0.882 0.896 
PC3 0.83 0.844 0.827 0.872 
PC4 0.919 0.939 0.905 0.955 
PCS 0.841 0.822 0.781 0.82 
Average 0.827 0.836 0.839 0.891 
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Figure 2. A graphical comparison of AUC performance for every experiment 


Table 3. The t-Test results for every method tested 
Method Comparison t-Test Value (a=0.05) _ Significance 


PSO MCFS-PSO 0.0008 Significant 
PSO MCFS-PSO CFS 0.0001 Significant 
PSO MCFS-PSO CMFS 0.0001 Significant 


Table 4. Proposed method comparison against other studies’ method 


Study Method Avg. AUC Proposed method Avg. AUC 
Muthukumaran et al. [37] LR CFS 0.812 0.891 
Kalsoom et al. [38] MLP FLDA-FS 0.866 0.891 
Iqbal and Aftab [39] MLP MFFS ROS 0.817 0.891 


4. CONCLUSION 

In this paper, MCFS is proposed, designed to address challenges associated with noisy attribute data 
and high-dimensional datasets, thereby enhancing predictive performance within the domain of SDP. 
Experimental results investigations involved a comparative analysis of MCFS against various techniques, 
including the independent application of PSO, PSO coupled with CFS, and PSO combined with CMFS. The 
outcomes of these experiments affirm the superiority of the proposed MCFS approach in mitigating issues 
and improving predictive modeling. 

The proposed method consistently achieves superior classification performance, boasting an average 
AUC of 0.891. In contrast, when PSO is applied independently, it yields a notably lower AUC of 0.827. 
Furthermore, the integration of PSO with CFS results in a modest improvement, with an AUC of 0.836. 
Notably, when PSO is coupled with CMFS, the AUC increases to 0.839. Additionally, statistical significance 
was confirmed through a t-Test comparing the performance of the proposed MCFS method with the 
alternative approaches. The resulting a-value was less than 0.05, signifying a statistically significant 
difference between MCFS and the other methods. This observation becomes more pronounced when 
comparing MCFS with methodologies employed in other studies, underscoring MCFS’s superior 
performance. This statistical analysis further reinforces the assertion that MCFS is a robust solution for 
mitigating challenges associated with noisy attributes and high-dimensional data in SDP, all while enhancing 
predictive performance. 
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